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ABSTRACT 


An  investigation  was  made  of  the  feasibility  of  using  computers  to 
assign  the  proper  security  classification  (unclassified,  confidential, 
secret)  to  textual  material.  The  words  in  9'3S  paragraphs  were  transformed 
to  computer-usable  form.  A  set  of  66  variables  was  computed  for  each 
paragraph  by  a  two-stage  process  of  attaching  three  scores  to  a  word  and 
then  combining  the  scores  in  various  ways  over  the  words  of  a  paragraph. 
Several  experiments  were  conducted  to  validate  assumptions  involved  in 
the  method  of  scoring  the  words  and  the  methods  for  combining  the  scores. 
The  66  variables  were  presented  to  a  statistical  technique  which  made  a 
preferential  selection  of  a  small  set  of  effective  variables  from  the 
large  set  of  66  variables.  The  redundant  or  non-controlling  variables 
were  eliminated  from  subsequent  analysis,  and  an  objective  system  was 
developed  for  assigning  security  classifications  using  only  the  selected 
variables.  The  system  was  applied  to  an  independent  sample  of  paragraphs 
and  53.9  percent  were  correctly  classified.  It  was  concluded  that  the 
system  does  exhibit  skill.  However,  the  skill  is  probably  too  low  to 
consider  replacing  the  present  system.  Finally,  it  is  concluded  that  the 
method  for  forming  variables  and  the  statistical  technique,  both  apparently 
new  to  this  field,  show  sufficient  promise  to  merit  application  to  other 
automatic  indexing  problems. 
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t,  VALUATION 


The  objective  of  this  effort  was  to  determine  the  feasibility  of 
using  a  computer  algorithm  to  ossiun  to  partu-raphs  tb«*i'  proper  security 
classification  (unclassified,  confidential,  secret). 

Multiple  discriminant  analysis  and  regression  estimation  of  ev^nt 
probabilities  techniques  were  used  on  a  dependent  sample  of  paragraphs  to 
develop  the  algorithm.  An  independent  sample  was  used  to  test  the  algo¬ 
rithm. 

hxporimental  results  have  shown  that  the  algorithm  has  predicted  the 
pro per  security  classification  at  a  level  much  higher  than  could  oe  at¬ 
tained  by  chance.  However,  this  level  is  too  low  to  varrent  using  the  al¬ 
gorithm  in  place  of  present  classification  methods.  The  degree  of  com¬ 
promise  of  classified  information  is  high  as  well  as  the  degree  of  over- 
classification. 

Separate  analysis  of  each  security  category  reveals  that  in  9  out  of 
12  experiments  the  algorithm  exhibits  difficulty  in  assigning  the  proper 
security  classification  to  confidential  material.  This  indicates  that  tnc 
confidential  category  does  not  contain  enough  specific  information  to  allow 
trie  algorithm  to  distinguish  it  from  the  other  categories. 

This  study  implies  that  before  automatic  security  classification  can 
be  realized,  confidential  and  secret  material  should  be  examined  to  deter¬ 
mine  how  much  overclassification  exists.  Human  classifiers  could  reveal 
to  the  reecurcner  what  experience  factors  are  used  in  classifying  textual 
data.  Lven  if  a  completely  automatic  algorithm  cannot  be  developed,  it 
may  prove  to  oe  sufficiently  accurate  to  aid  the  human  classifier  and  re¬ 
duce  his  task. 

^U.4u,-&  A* 

NICHOLAS  M.  UiFONDI 
Technical  ^valuator 


viii 


SECTION  ! 


INTRODUCTION 

The  assignment  of  security  classifications  to  military  and  government 
publications  is  a  problem  because  assessing  the  importance  of  the  contents 
of  a  document  Is  Judgmental,  time-consuming  and  potentially  expensive.  Ovet- 
elassi f ication  imposes  unnecessary  handling  restrictions,  limits  the  dissemi¬ 
nation  of  useful  information,  and  impedes  the  further  development  of  ideas. 
Underclass i £i cation  compromises  the  very  values  that  classification  seeks  to 
protect.  in  the  face  of  the  "information  explosion,"  imposing  increasing 
demands  on  document  monitors,  other  means  of  classifying  textual  material 
are  being  sought. 

The  feasibility  of  using  computers  was  investigated  to  assign  the  proper 
security  classification  (unclassified,  confidential,  secret)  to  textual  material. 
Statistical  analysis  of  the  frequency  and  distribution  of  words  within  a  para¬ 
graph  led  to  a  computer-automated  procedure  of  security  classification.  The 
procedure  was  tested  for  accuracy  on  independent  data  by  comparing  its  security 
assignments  with  those  made  subjectively. 


1 


SECTION  II 


PREPARATION  OF  DATA 

The  words  in  993  paragraphs  classified  as  either  unclassified,  confi¬ 
dential,  or  secret  were  punched  onto  IBM  cards  and  then  placed  onto  magnetic 
tape.  Described  below  are  criteria  for  selection  of  paragraphs,  rules  for 
punching  data,  editing  procedures,  elimination  of  function  words,  and  gener¬ 
ation  of  word  pairs. 

1.  Selection  of  Paragraphs 

All  documents  were  chosen  from  the  chemical-biological  warfare  field. 

The  decision  to  confine  attention  to  just  one  field  of  knowledge  is  desirable 
for  two  reasons:  to  reduce  the  number  of  different  words  and  to  obtain  a 
homogeneous  sample  of  paragraphs.  It  is  well-known  in  linguistic  studies 
that  fo.  a  given  number  of  words,  the  number  of  different  words  increases 
rapidly  with  the  number  of  different  fields  of  knowledge.  This  smaller 
number  of  different  words  is  much  easier  to  handle  both  from  a  statistical 
viewpoint  (e.g,,  the  sampling  variabilities  of  frequently  occurring  words 
are  less  th.ui  those  of  infrequent  ones)  and  from  a  data  processing  viewpoint 
(less  data  are  easier  to  process). 

It  is  even  more  important  to  have  a  homogeneous  sample  of  paragraphs. 

The  overall  objective  of  this  study  is  to  discriminate  among  paragraphs  with 
different  security  classifications.  This  is  best  accomplished  by  minimizing 
paragraph  differences  caused  by  any  circumstance  other  than  security  clas¬ 
sification  for  these  can  only  obscure  the  differences  due  to  security 
classification.  Specif icaliy ,  different  fields  of  knowledge  imply  different 
paragraph  content  which  might  overwhelm  the  differences  due  to  security 
classification.  The  chemical-biological  warfare  field  was  chosen  because 

a  number  of  documents  were  readily  available  from  previous  studies. 
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It  would  be  desirable  to  select  all  paragraphs  from  one  document  which 
would  provide  a  homogeneous  sample  of  paragraphs.  As  this  was  not  possible, 
a  number  of  documents  were  chosen.  Documents  were  selected  if  each  individual 
paragraph  was  classified  separately  and  if  the  document  contained  all  three 
types  of  paragraphs — unclassified,  confidential,  and  secret.  Forty-two 
documents  were  chosen. 

Within  each  document  the  selection  of  paragraphs  was  governed  by  the 
following  criteria: 

(a)  length  between  50-200  words,  in  some  cases  small  con¬ 
secutive  paragraphs  with  the  same  classification  were 
combined  to  achieve  the  minimum  length; 

(b)  none,  or  very  few,  formulas,  equations,  or  other  non¬ 
word  characteristics;  and 

(c)  as  far  as  possible  paragraphs  with  different  classifi¬ 
cations  were  alternated  (i.e.,  C,  S,  U,  S,  C,  U,  etc.). 

Nine  hundred  ninety-eight  paragraphs  were  chosen:  341  unclassified,  338 

confidential,  and  319  secret. 

2  .  Card-Punching  Rules 

There  are  available  a  number  of  sets  of  rules  for  card-punching  textual 
material.  However,  these  were  devised  for  the  purpose  of  semantic  and  syntactic 
analysis  and  were  deemed  to  be  too  elaborate  for  this  study.  Therefore,  a  set 
of  rules  was  devised  and  kept  as  simple  as  possible  to  minimize  the  number  of 
errors . 

At  the  beginning  of  each  paragraph  a  header  card  was  punched  containing 
only  the  number  of  the  paragraph  and  its  security  classification.  Punching  of 
words  started  in  column  one  of  the  next  card,  continued  through  column  70,  then 
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onto  column  one  of  the  next  card  through  column  70,  etc.  One  word  could 
overlap  two  cards.  Characters  not  appearing  on  the  key-punch  keyboard  were 
ignored.  Two  periods  were  used  at  the  end  of  a  sentence.  Numbers  referring 
to  footnotes  were  not  punched  but  exponents  of  variables  were  punched. 

3.  Definition  of  "Word,"  "Word-Pair,"  and  "Function  Word" 

3.1  Word 

All  information  from  a  single  paragraph  was  treated  as  one  long  serial 
string  of  characters.  It  will  be  recalled  from  Section  2.2  that  characters 
that  do  not  appear  on  the  IBM  key-punch  were  not  punched.  The  characters 
that  do  appear  were  separated  into  two  kinds — permissible  and  non-permissible. 
The  18  non-permissible  characters  are: 


< 

Less-than  Sign 

» 

Comma 

( 

Left  Parenthesis 

% 

Percent 

1 

Vertical  Bar,  Logical  OR 

— 

Underscore 

& 

Ampersand 

> 

Greater-than  Sign 

J 

Exclamation  Point 

• 

Colon 

* 

Asterisk 

// 

Number  Sign 

) 

Right  Parenthesis 

@ 

At  Sign 

Semicolon 

' 

Prime,  Apostrophe 

— 1 

Logical  NOT 

II 

Quotation  Marks 

A  "word"  is  defined  as  a  string  of  successive  characters  bounded  by  either 
a  blank,  a  non-permissible  character,  or  an  end-of-sentence  mark. 

3,2  Word-Pair 

A  word-pair  is  defined  here  as  two  consecutive  words  in  the  same  sentence. 
For  example,  in  the  sentence  GEORGE  WASHINGTON  CROSSED  THE  DELAWARE.,  the  pairs 


U 


are  GEORGE  WASHINGTON,  WASHINGTON  CROSSED,  and  CROSSED  DELAWARE.  Note  that 
THE  is  eliminated  because  it  is  a  function  word. 


3.J  Function  Words 


Function  words  include  those  which  are  traditionally  called  articles,  prep¬ 


ositions,  pronouns,  conjunctions  and  auxiliary  verbs,  plus  certain  irregular 


forms . 

The 

list  below 

defines 

the  function  words 

of  this 

study.  (The 

reason 

for  truncation  to  six 

letters 

will  be 

given  later. 

) 

ARE 

DOES 

PAST 

ALONE 

UNTIL 

EITHER 

NEITHE 

THINGS 

A 

BUT 

DONE 

PLUS 

ALONG 

WASNT 

ELSEWH 

N EVERT 

THOUGH 

L 

CAN 

DONT 

REAL 

AMONG 

WHERE 

ENOUGH 

NOBODY 

THROUG 

AM 

DID 

DOWD 

SAME 

APART 

WHICH 

EVERMO 

NOTHIN 

TOGETH 

AN 

ETC 

EACH 

SELF 

ASIDE 

WHILE 

EVERYO 

NOWADA 

TOWARD 

AS 

FEW 

ELSE 

SOME 

BEING 

WHOSE 

EVERYT 

NOWHER 

UNDERN 

AT 

FOR 

EVEN 

SUCH 

BELOW 

WOULD 

EVERYW 

OFTENT 

UNDOIN 

BE  ' 

GET 

EVER 

THAN 

COULD 

ACROSS 

EXCEPT 

OTHERS 

UNLESS 

BY 

COT 

FROM 

THAT 

DOING 

AGAINS 

FAIRLY 

OTHERW 

WHATEV 

DO 

HAD 

GETS 

THEM 

EVERY 

ALREAD 

FARTHE 

OURSEI. 

WHENEV 

HE 

HAS 

HAVE 

THEN 

LATER 

ALMOST 

FOREGO 

OUTS ID 

WHF.REA 

IF 

HOW 

HERE 

THEY 

LEAST 

ALTHOU 

FOREVE 

OUTWAR 

WHEREF 

IN 

ITS 

INTO 

THIS 

MIGHT 

ALWAYS 

FORWAR 

OVERMU 

WHEREI 

IS 

MAY 

JUST 

THUS 

NEVER 

AMOUNT 

FURTHE 

PERHAP 

WHEREV 

rr 

NOW 

KEEP 

UNTO 

OFTEN 

ANOTHE 

HARDLY 

PLEASE 

WHETHE 

ME 

OUR 

KEPT 

UPON 

OTHER 

ANYBOD 

HAVING 

PRETTY 

WITHIN 

MY 

OWN 

LESS 

VERY 

OUGHT 

ANYONE 

HEIGHT 

RATHER 

WITHOU 

NO 

THE 

LEST 

WELL 

QUITE 

ANYTH I 

HF.NCEF 

REALLY 

YOURS E 

OF 

TOO 

MANY 

WERE 

RIGHT 

ANYWHE 

HEREIN 

SEVERA 

Oil 

WAS 

MINE 

WHAT 

SHALL 

AROUND 

HITHER 

SHOULD 

ON 

WAY 

MORE 

WHEN 

SHALT 

AWFULL 

HOWEVE 

SOMEBO 

OR 

WHO 

MOST 

WHOM 

SINCE 

AWHILE 

INDEED 

SOMEDA 

SO 

WHY 

MUCH 

WILL 

STILL 

BACKWA 

INSTEA 

SOMETH 

TO 

YES 

Ml.  ST 

WILT 

THETR 

BECAUS 

INWARD 

SOMETI 

UP 

YET 

NEXT 

WITH 

THERE 

BEFORE 

ITSELF 

SOMEWH 

US 

YOU 

NONE 

YOUR 

THESE 

BEHIND 

LIKEWI 

THEIRS 

WE 

ALSO 

ONES 

ABOUT 

THING 

BETWEE 

MIDDLE 

THEMSE 

ALL 

AWAY 

ONLY 

ABOVE 

THOSE 

BEYOND 

MIGHTY 

THEREA 

AND 

BEEN 

ONTO 

AFTER 

TRULY 

CANNOT 

MOREOV 

THEREF 

ANY 

BOTH 

OVER 

AGAIN 

UNDER 

DURING 

MYSELF 

THEREW 
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L _ yjJ-.  iM.  p  rocedures 

A  three-stage  procedure  was  followed  to  correct  errors. 

Proof  reading.  The  cards  were  listed  and  the  listings  were  proofread. 

Errors  were  corrected  by  punching  new  cards. 

Misspellings.  The  cards  were  placed  onto  magnetic  tape  and  the  words  were 
extracted  using  the  definition  of  Section  2.2.1.  There  were  112,774  words  in 
ail.  These  were  alphabetized  by  the  computer  and  all  words  appearing  only  once 
or  twice  were  printed  by  the  computer.  The  rationale  here  is  that  the  same 
misspelling  will  not  occur  more  than  twice.  The  printouts  were  proofread  and 
misspellings  were  corrected  by  repunching  cards. 

Hyphens .  Depending  upon  the  author,  the  same  word  may  or  may  not  be 
hyphenated,  e.g.,  anti-aircraft,  antiaircraft.  These  words  presented  a 
special  problem  and  were  examined  carefully.  A  correction  card  eliminating 
the  hyphen  was  punched  for  those  hyphenated  words  which  we  considered  to  be 
two  words.  The  hyphens  within  the  remaining  hyphenated  words  were  eliminated 
by  "squeezing-up"  those  words.  After  correction  for  hyphens  the  misspelling 
edit  was  repeated. 

5.  Form  of  Data 

Function  words  were  eliminated  from  the  edited  data  because  it  was  felt 
that  they  would  not  contribute  to  discriminating  among  unclassified,  confidential, 
or  secret  paragraphs.  The  function  words  constituted  some  42  percent  of  the 
total  number  of  words  and  their  elimination  saved  considerable  computer  time. 

All  words  were  truncated  to  six  letters.  In  word  studies  it  is  desirable — 
and  it  is  common  practice — to  define  two  words  with  the  same  root  but  different 
endings  as  the  same  word.  However,  the  rules  for  so  doing  are  quite  elaborate 
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and  it  was  not  deemed  worthwhile  to  write  the  complicated  computer  programs 
necessary  to  apply  such  rules.  The  truncation  is  a  simple  expedient  to  com¬ 
bine  words  with  the  same  root.  It  does  occasionally  result  in  combining  two 
words  with  different  roots  and  it  does  not  combine  words  of  less  than  six 
letters.  Nevertheless,  it  is  a  rather  effective  substitute  for  the  more 
accurate,  but  much  more  complicated,  procedures  now  in  use. 

All  words  and  word-pairs  were  then  placed  onto  a  magnetic  tape  for 
subsequent  processing. 
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SECTION  III 


STATISTICAL  METHODOLOGY 


Two  statistical  procedures  were  used  to  develop  objective  methods  for 
assigning  security  classifications  to  paragraphs.  The  techniques  are  (1) 
stepwise  linear  regression,  and  (2)  regression  estimation  of  event  proba¬ 
bilities  (KEEP).  The  techniques  have  been  described  in  some  detail  in  other 
publications:  stepwise  regression  in  [5]  and  REEF  in  [4],  Detailed  descrip¬ 
tions  of  applications  of  these  techniques  are  available  in  [7]  and  [3].  In 
this  section  we  simply  describe  the  procedures;  their  application  is  considered 
in  later  sections. 

The  998  paragraphs  were  separated  into  two  samples  by  extracting  every 
third  paragraph.  The  statistical  procedures  are  applied  to  the  larger — or 
developmental-sample,  and  the  resulting  methods  for  assigning  security 
classifications  were  tested  for  accuracy  on  the  independent  sample. 

In  both  techniques,  a  stipulated  variable— security  classif ication— called 
the  predictand  is  the  object  of  estimation.  The  variables  used  to  make  the 
estimation  of  the  predictand  are  called  predictors.  Both  techniques  begin  with 
computation  of  a  "predictor-predictand"  matrix  as  in  Table  I. 

The  general  entry,  X  ,  in  Table  I  is  the  value  of  the  m-th  predictor 
nm 

in  the  n-th  paragraph.  The  formation  of  predictors  is  discussed  in  detail  in 
Section  4.  Tire  predictand  variables,  the  Y*s,  are  dummy  variables — i.e.,  vari¬ 
ables  which  can  take  on  only  the  values  zero  or  one.  For  exarple,  Y  ^  takes 

on  a  value  of  one  if  paragraph  n  is  unclassified  and  Y  ■  Y  -  zero;  Y  is 

nC  no  nL 

one  if  a  paragraph  is  confidential  and  YnU  “  YnS  *  zero;  and  similarly  for 
Yns  and  secret.  N  is  the  number  of  paragraphs  in  the  developmental  sample. 
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Paragraph 

Number 


1 

2 


n 


N 
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TABLE  I 

PREDICTOR-PREDICTAND  MATRIX 


1 

Predictor 

2 

Number 

m  .  •  . 

M 

Predi ctand 

u  c  s 

xn 

Xi2 

Xlm  *  *  * 

X1M 

Y1U 

Y1C 

Y1S 

X21 

X22 

X2m  ••• 

X2M 

Y2U 

Y2C 

Y2S 

Xnl 

Xn2  ’ *  * 

X 

am 

XaM 

YaU 

YnC 

YnS 

Si 

S2  - 

hm  **• 

Sm 

Y 

NU 

ync 

yns 

The  number  of  plausible  predictors  that  might  serve  to  assign  security 

classifications  to  paragraphs  is  very  large,  if  not  virtually  unlimited.  For 

example,  X  could  be  a  frequency  count  of  the  number  of  times  that  word  m 
ntn 

occurred  in  paragraph  n,  in  which  case  the  number  of  predictors  K  would  be 
equal  to  the  number  of  different  words  in  all  N  paragraphs.  This  situation 
imposes  the  practical  necessity  of  selecting  a  manageable  number  of  predictors. 
The  statistical  techniques,  therefore,  include  provisions  for  the  preferential 
selection  of  effective  predictors  from  a  very  large  set  of  possible  choices 
for  use  in  regression  or  REEP.  Substantial  previous  experimentation  comparing 
performance  on  independent  data  of  estimating  functions  using  large  numbers 
of  predictors  with  those  using  selectively  chosen  subsets  of  such  variables 
has  shown,  as  a  rule,  that  whatever  predictability  resides  in  a  large  set  is 
almost  wholly  contained  in  the  much  smaller  subset.  The  objective  selection 
of  such  a  small  subset  is  termed  screening.  After  screening,  the  redundant 
or  non-controlling  predictors  are  eliminated  from  subsequent  analyses,  and 
a  system  for  assigning  security  classifications  to  paragraphs  is  developed 
using  only  the  selected  predictors. 

6.  Stepwise  Linear  Regression 

In  multiple  regression,  a  predictand  Y  is  expressed  as  a  linear  function 

of  a  number  M  of  predictor  variables  X  (m«l,2, . . . ,M) . 

m 

'■‘•“iV'A . .  (IU~1> 

where  the  coefficients  A  (m-0, 1, . . . ,M)  are  determine*!  by  least  squares.  Y 
can  be  an  estimate  of.  any  one  of  the  three  Y’s  of  Table  I,  i.e.,  the  stepwise 
technique  is  applied  three  timc3. 
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As  noted  above,  if  M  is  large,  screening  is  desirable.  To  select  the 
first  predictor,  the  simple  linear  correlation  coefficient  is  computed  between 
the  predictand  Y  and  each  of  the  entire  set  of  M  predictors.  The  predictor 
giving  the  best  coefficient  (i.e.,  highest  in  absolute  value)  is  selected  as 
the  first  predictor.  Next,  the  partial  correlation  coefficient  between  each 
of  the  remaining  predictors  and  the  predictand  (holding  the  first  selected 
predictor  constant)  are  examined,  and  the  predictor  associated  with  the  best 
coefficient  is  then  selected  as  the  second  predictor.  Additional  predictors 
are  selected  in  a  similar  manner.  At  each  step  partial  correlations  are 
computed  between  the  predictand  and  each  of  the  remaining  predictors  while 
holding  constant  the  previously  selected  predictors.  The  predictor  associated 
with  the  best  partial  correlation  is  selected.  This  is  equivalent  to  selecting 
that  predictor  which  adds  the  most  independent  predictive  information  to  the 
previously  selected  predictors.  At  each  step  a  test  is  made  to  see  if  the  new 
predictor  selected  adds  a  satisfactory  amount  of  additional  information.  When 
the  test  fails  selection  is  halted.  A  multiple  regression  is  then  computed 
between  the  variable  to  be  predicted  and  the  small  set  of  selected  predictors. 

The  stepwise  regression  technique  will  result  in  three  equations — one 
for  each  security  classification.  In  applying  the  equations  to  the  test  sample 

A 

of  paragraphs,  the  largest  Y  gives  the  security  classification  assignment. 

7.  Regression  Estimation  of  Event  Probabilities  (REEP) 

The  REEP  technique  is  much  like  stepwise  regression  but  differs  in  two 
important  respects:  (1)  the  use  of  dummy  predictors  exclusively,  and  (2)  the 
simultaneous  consideration  of  all  three  predictand  dummy  variables — the  three 
Y's  of  Table  I — rather  than  piecemeal  consideration. 
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Tha  first  step  In  the  KEEP  procedure  i*  to  transform  each  predictor 
variable  to  a  aat  of  "duoay  variable*."  A  downy  variable  i*  a  variable 
which  can  take  on  only  two  values,  zero  or  one.  An  example  illustrates 
the  procedure  for  continuous  variates.  If  X  is  a  continuous  variable, 
it  can  be  divided  into  C  ranges  by  specifying  G-l  class  limits,  X^,  X, , 
where 

-  *  <  X  *  Xr  Xj  <  Xi  X2 . xc_1  <  X  <  +  -  . 

A  set  of  G  dummy  variables  Is  generated  by  finding  the  range  which  encloses 
a  specific  X-value  and  assigning  a  one  (1)  to  the  proper  dummy  variable  and 
zero  (0)  to  the  remaining  G-l  dummy  variables.  This  procedure  is  repeated 
for  all  N  values  of  X.  Qualitative  variables  are  transformed  into  dummy 
predictors  in  a  manner  similar  to  that  indicated  in  Table  I  for  the  pre- 
dictand  "security  classification."  This  permits  qualitative  variables  to 
be  Incorporated  into  the  REEP  procedure  in  a  natural  and  easy  manner. 

The  selection  of  predictors  is  made  by  fir3t  computing  the  simple  linear 
correlation  coefficient  between  each  dummy  predictor  and  each  dummy  predictand 
— or  3  times  M  coefficients  in  all.  The  highest  coefficient  of  this  entire 
set  gives  the  first  predictor.  Next,  the  partial  correlation  coefficients 
between  each  of  the  remaining  predictors  and  each  of  the  dummy  predictands 
are  examined  and  the  highest  one  gives  the  second  predictor.  Additional 
predictors  are  selected  in  a  similar  manner.  Finally,  three  regression 
equations  are  computed,  one  between  each  of  the  three  predictand  variables 
and  the  final  set  of  selected  predictors.  Application  of  the  equations  to 
the  test,  sample  of  paragraphs  will  give  the  probabilities  that  a  paragraph 
belongs  to  each  of  the  three  security  classifications. 
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SECTION  IV 


FORMATION  OF  PREDICTOR  VARIABLES 

Tlie  basic  assumption  underlying  this  study  is  that  the  words  of  a  para¬ 
graph  contain  information  for  determining  Its  security  classification.  To 
extract  such  information,  66  predictor  variables  which  potentially  contain 
information  were  computed.  These  were  arranged  in  a  matrix  as  indicated  in 
Table  1  and  the  two  statistical  techniques  were  applied  to  select  in  an 
objective  manner  those  predictor  variables  which  actually  contain  information 
for  discriminating  among  the  three  types  of  paragraphs. 

Three  decisions  discussed  previously  impact  on  the  formation  of  predictor 
variables:  (a)  function  words  were  eliminated,  (b)  word-pairs  are  considered 

as  words,  and  (c)  the  developmental  sample  of  paragraphs  is  used  for  generating 
the  matrix. 


Predictor  variables  were  formed  by  a  two-stage  process  of  first  attaching 
three  scores  to  each  word  and  then  combining  the  scores  over  the  words  in  a 
paragraph.  The  combination  was  done  in  a  variety  of  ways  resulting  in  a  number 
of  predictor  variables.  In  this  section  we  describe  the  method  of  forming  the 
predictor  variables.  Their  use  is  discussed  in  subsequent  sections. 

8.  Word  Scores 

Three  scores  were  assigned  to  each  word.  As  a  first  step  in  devising  the 
scores,  it  was  assumed  that  it  is  simply  the  appearance  or  non-appearance  of  a 
word  in  a  paragraph,  rather  than  the  number  of  times  it  appears,  which  serves 
to  determine  the  security  classification  of  the  paragraph.  Consequently,  a 
count  was  obtained  of  the  number  of  different  paragraphs  in  which  each  word 


appears.  This  number  was  designated  as  for  the  j-th  word.  Some  of  these 
Nj  paragraphs  were  unclassified,  some  confidential,  and  some  secret.  These 
three  quantities  were  denoted  as  N  ,  N  ,  and  N  ,  where 

J  U  J  C  J  8 


(iv-1) 


The  contention  was  adopted  that  an  individual  word  will  be  counted  each  time 
it  appears  whether  alone  or  as  part  of  a  word-pair1  e.g.,  if  word  j  appeals 
eight  tint.  alone  and  four  times  as  part  of  a  word-pair,  denoted  as  word  k, 

then  N.  ■  12  and  N,  *  4. 

J  k. 

If  it  is  assumed  that  word  j  does  r.ot  offer  any  inform/  tion  whatsoever 
about  the  security  classification  of  paragraphs,  then  the  number  of  different 
unclassified  paragraphs  in  which  word  j  should  appear  is  proportional  to  the 
total  number  of  unclassified  paragraphs.  (Such  an  assumption  is  termed  the 
null  hypothesis  in  statistical  decision  theory.)  Similarly,  the  number  of 
different  confidential  and  secret  paragraphs  in  which  word  j  should  appear  is 
proportional  to  the  total  number  of  such  paragraphs.  Mathematically, 
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Intuitively,  the  scores  appear  reasonable.  Their  numerators  are  deviations 


of  actual  values  from  expected  values.  Thus,  if  N,  exceeds  E,  by  a  large 

js  J» 


amount  then  word  j  appears  in  secret  paragraphs  much  more  often  than  expected 

by  chance.  Such  a  word  should  be  useful  for  distinguishing  between  secret 

and  other  types  of  paragraphs.  The  denominators  arc  factors  which  take  into 

account  the  number  of  paragraphs  in  which  word  j  appears.  This  is  required 

since  a  large  value  of  (N  -  E.  )  does  not  mean  the  same  thing  i,r  word  1 

j»  js 


appears  ir  50  different  paragraphs  (N  -  50),  as  it  does  if  word  j  appears 

Js 


in  only  four  paragraphs  (N  «  4), 


Statistically,  each  S-value  is  a  "chi  variate"  which  approximates  a  unit 
normal  deviate  as  increases.  Such  approximations  have  desirable  properties; 
in  particular,  their  sum  is  a  meaningful  quantity.  (An  example  of  non-neaningful 
quantities  is  the  sum  of  the  numerators  alone  or  the  sum  of  ratios  such  as 


Njg/Nj.)  Considerable  effort  has  been  devoted  in  the  field  of  statistical 


theory  to  the  minimum  value  that  1  can  as-ume  for  S  to  have  the  desirable 


properties.  For  the  case  considered  here,  where  E.  ,  E.  ,  and  E  are  approxl- 

ju  jc  js  rr 


mately  equal  because  Ku  “  Kc  “  Ks '  itL  has  been  found  that  E  a  1.0  is  satis¬ 


factory.  Therefore,  only  those  words  for  which  is  at  least  four  (4)  have 


scores  attached  to  them. 

9.  Combining  of  Scores 

There  were  three  different  combination  methods',  and  each  method  resulted 
1  n  a  number  of  predictor  variables. 

9.1  Means 


The  scores  were  summea  in  various  vays  over  the  usable  words  of  a  paragraph. 
A  usable  word  has  three  characteristics:  (a)  ■‘t  is  not  a  function  word,  (b)  it 
appears  in  at  least  four  different  paragraphs  in  the  developmental  sample  of 
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data,  and  (e)  It  haa  not  appeared  previously  In  the  sane  paragraph.  Charac¬ 
teristic  (c)  la  necessary  to  assure  that  a  word  appearing  more  than  once  in  a 
paragraph  will  have  its  scores  summed  only  once,  to  account  for  the  varying 
number  of  usable  words  in  the  paragraphs,  each  sum  is  divided  by  the  total 
number  of  usable  words,  denoted  by  for  the  i-th  paragraph. 

The  first  three  predictors  are  the  arithmetic  means  of  the  three  scores 
over  all  usable  words  of  the  paragraph: 


MU100i  -  (ESJu)/l-1 

MC1Q0 ,  -  (ES.  )/L.  vIV-4) 

i  jc  i 

MS100i  -  CESjs)/L± 

(The  reason  for  the  notation  is  given  later.) 

Other  means  were  also  computed.  The  rationale  for  such  means  can  be  Il¬ 
lustrated  by  an  example.  Consider  u  paragraph  with  a  few  large  positive  values 

of  S  but  many  small  negative  values  of  S  .  The  sum,  MS100,  ,  could  easily  be 
s  s  i 

negative.  Yet,  logically,  it  is  quite  conceivable  that  there  be  just  a  few 
strong  words  which  make  a  paragraph  secret  and  the  remainder  of  the  words  in 
the  paragraph  may  not  be  important.  To  take  account  of  such  possibilities, 
additional  means  are  computed: 

MU—  t  -  (ZA(k)  Sj^/Lj 

MC— .  -  (EA(k)  S.  )/L.  (IV-5) 

i  jc  i 

MS— t  «  (EA(k)  S  >/Lt 


.  A(k) 
where  A 


r  i  it  s3>>  i 

\°  i*  s3.‘  ‘ 
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and  a  can  be  u,  c,  or  s.  Equal ions  (IV-5)  state  that  the  scores  of  the  words 
of  paragraph  i  are  summed  only  for  those  cases  for  which  the  score  exceeds  k. 
For  example,  with  k  «  0  only  positive  values  of  S  are  summed. 

It  Is  not  desirable  to  assume  which  value  of  k  would  result  In  predictor 
variables  containing  the  maximum  amount  of  information  for  discriminating 
among  Che  three  typee  of  paragraphs.  Therefore,  k  was  set  equal  to  a  number 
of  different  values.  Insertion  of  each  k  in  equations  (IV-5)  results  in  three 
predictor  variables. 

To  develop  a  reasonable  set  of  k-values,  advantage  was  taken  of  the  fact 

that  the  scores — the  S-values — are  approximately  normally  distributed.  Cox  [2] 

and  Eryan  and  Southan  (1]  have  developed  a  method  for  the  optimum  subdivision 

of  .'.  normally  distributed  variable.  Their  method  was  applied  as  follows:  The 

S  scores  for  all  words  in  all  paragraphs  constitute  a  normally  distributed 
u 

variable.  Similarly,  Sc  is  a  normal  variate  and  so  is  Sfi.  The  three  variables 
are  treated  separately.  A  variable  is  arrayed  from  lowest  to  highest.  The 
array  is  examined  to  choose  class  limits  which  subdivide  the  variable  in  the 
manner  indicated  in  Table  II.  The  class  limits  are  the  k-values. 


TABLE  II 


SAMPLE  ARRAY  TO  CHOOSE  CLASS  LIMITS 


Class  Limit 


Percentage  of  Scores 
Higher  than  Class  Limit 

98 

91 

80 

66 

50 

34 

20 

9 

2 
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Use  of  t.ie  k-velues  obtained  In  this  way  resulted  in  27  variables,  9 
for  each  of  the  three  types  of  scores.  The  notation  of  the  variables  is 
HU98i>  MC98^,  MS981  when  ^  is  used  in  equations  (4-5);  MU91  ,  MCSl^  H591  ± 
when  is  used,  etc. 

9,2  Frequencies 

Frequency  predictor  variables  are  defined  in  precisely  the  same  manner 
as  Mean  variables  except  that  the  S-values  are  counted  instead  of  being  sunmed. 
Specifically, 


where,  as  before, 

,<k)  f  1  if  Sja  >  k 

Aju  if  Sja  <  k 

and  the  summations  are  made  over  the  usable  words  of  paragraph  i. 

Twenty-seven  predictors  were  computed  by  using  the  9  k-values  listed  in 
Table  il:  FU98i>  FC98i,  FS981 . FU02i>  FC02i>  FSC^. 

9.3  Highest  Value  Sums 

There  are  nine  such  predictor  variables  defined  as  follows: 

HUl^  ■  largest  su  value  in  paragraph  i. 

HU3^  -  sum  of  three  largest  values  in  paragraph  i. 

HU5^  -  sum  of  five  largest  values  in  paragraph  i. 

Similar  definitions  hold  for  HC-  and  HS-. 
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9.4  Summary 


The  66  predictors  are  listed  in  Table  III. 


TABLE  III 

PREDICTOR  VARIABLES 


HU100 

HU98 

MU91 

MU80 

MU66 

MU50 

MU  34 

MU20 

MU09 

MU02 

FU98 

FU91 

FU80 

FU66 

FU50 

FU34 

FU20 

FU09 

FU02 

HU1 

HU2 

HU3 


MCI  00 

MC98 

MC91 

MC80 

MC66 

MC50 

MC34 

MC20 

MC09 

MC02 

FC98 

FC91 

FC80 

FC66 

FC50 

FC34 

FC20 

FC09 

FC02 

HC1 

HC2 

HC3 


MS  100 

MS98 

MS91 

MS80 

MS66 

MS50 

MS  34 

MS2Q 

MS09 

MS  02 

FS98 

FS91 

FS80 

FS66 

FS50 

FS34 

FS20 

FS09 

FS02 

HS1 

HS2 

HS3 
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SECTION  V 


EXPERIMENTS 

Pour  experiments,  Sections  1  through  13,  were  completed  to  test  some 
assumptions  made  In  developing  the  word-itcores  and  their  combination.  The 
paragraphs  were  separated  into  two  samplea--a  dependent  sample  for  developing 
computer  methods  of  assigning  security  classifications  to  paragraphs,  and  an 
independent  sample  for  testing  the  accuracy  of  the  methods.  An  experiment 
(Section  14)  was  made  to  study  the  optimum  size  of  the  dependent  sample.  In 
conducting  this  experiment,  it  was  noted  that  there  was  quite  a  loss  in  ac¬ 
curacy  from  the  dependent  to  the  independent  samples.  A  procedure  (see 
Section  15)  was  devised  and  tested  to  reduce  this  loss  in  accuracy.  One 
of  the  two  statistical  techniques  was  used  to  develop  an  automated-computer 
method  for  assigning  security  classifications  (see  Sections  16  and  17). 

10.  Individual  Words  as  Predictors 

In  previous  work  in  development  of  automatic  indexing  procedures  (e.g., 

[8])  Individual  words  are  used  as  predictors  instead  of  being  scored  and  then 
collectively  combined.  For  example,  the  presence  or  absence  of  a  specified 
word  in  a  paragraph  could  constitute  a  dummy  variable  predictor  by  assigning 
a  one  for  presence  and  a  zero  for  absence.  Alternatively ,  the  predictor  could 
be  the  number  of  times  that  a  word  appears  in  a  paragraph.  A  full-scale  in¬ 
vestigation  of  the  relative  efficacy  of  such  predictors  as  compared  to  score- 
type  predictors  is  beyond  the  scope  of  this  study.  However,  some  useful 
information  could  be  obtained  in  a  simple  and  straightforward  manner. 

There  were  1741  different  individual  words — exclusive  of  function  words — 
and  643  different  word-pairs  appearing  in  four  or  more  paragraphs.  For  each  of 
these  2384  "words,"  frequency  counts  were  made  of  the  total  number  of  paragraphs 
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the  word  appeared  in  and  the  number  of  auch  paragraphs  which  were  unclassified, 
confidential,  and  secret.  These  quantities  are  the  h'j,  ^ju*  ^  *  and  de¬ 

fined  in  Section  8. 

A  detailed  but  entirely  subjective  examination  of  these  frequencies  wan 
sade.  Many  of  the  words  appeared  infrequently.  Such  words  cannot  be  expected 
to  be  good  predictors  simply  because  most  paragraphs  will  not  contain  then. 
Approximately  150  words  appearing  in  50  or  more  paragraphs— the  frequently 
occurring  words— did  not  appeav  to  occur  more  frequently  in  one  type  of  para¬ 
graph  than  in  another  (i.e.,  N.  ,  N,  ,  N,  did  not  differ  radically}.  Such 
a  r  ju»  jc*  js 

words  are  not  useful  predictors  by  themselves. 

It  was  concluded  that  summing  of  scores  over  all  usable  words  of  a  para¬ 
graph  would  tend  to  accumulate  the  small  amount a  of  information  In  each  word 
and,  therefore,  would  be  wire  likely  to  prove  to  be  a  good  prediction  method. 
However,  the  issue  is  net  closed;  and  it  la  recommended  that  a  test  be  made 
to  determine  whether  individual  word  predictors  add  any  information  to  score- 
type  predictors. 

11,  Multiple  Word  Occurrence  in  a  Paragraph 

The  first  occurrence  only  of  a  word  in  a  paragraph  was  used  to  compute 
predictor  variables.  (See  Section  9.1  for  definition  of  a  usable  word.) 

This  assumes  that  subsequent  occurrences  contribute  no  Information  for  as¬ 
signing  i  security  classification  to  that  paragraph.  An  experiment  was 
performed  to  test  this  assumption. 

The  test  was  made  by  computing  a  small  set  of  predictor  variables  using 
first  occurrences  only  and  another  similar  set  using  all  occurrences.  The 
two  sets  were  then  used  to  assign  security  classification  to  paragraphs  and 
the  assignments  were  verified  by  comparison  with  the  actual  classifications. 
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Three  predictor  variable*  MU100,  MC100,  MS100  were  obtained  by  summing 
scores  for  usable  words— i.e.,  first  occurrences  only  are  included  in  the 
surd,  (See  equations  ( IV-4)  for  definitions  of  the  variables.)  Three  new 
predictor  variables  were  c Deputed  in  a  similar  manner  except  thet  scores 
for  all  occurrences  of  s  word  were  included  in  the  eutao.  The  six  variables 
were  computed  for  each  of  the  998  paragraph*. 

Using  the  666  paragraphs  of  the  dependent  sample,  mean  values  of  each 
variable  were  computed.  Deviations  from  mean  values  were  then  computed  for 
all  998  paragraphs.  To  illustrate  the  computations,  consider  the  variable 
MU100  of  which  998  values  were  computed  by  equations  (IV-4).  The.  mean  is 

MU1 00  -  (E  MU1001)/666  ,  (V-l) 

where  the  summation  is  over  all  paragraphs  in  the  dependent  sample.  The 
deviations  are 

mul00i  -  MU  100 1  -  MU100;  i-1,2 . 998  (V~2) 

The  largest  value  of  sulOO^,  mclOO^,  and  mslOO^  determines  the  assignment 
of  a  security  classification  to  paragraph  i.  Assignments  were  made  in  this 
way  to  all  998  paragraphs.  Verification  of  the  assignments  on  the  dependent 
and  Independent  samples  for  the  two  sets  of  variables  is  shown  in  Table  IV. 

The  accuracy  of  the  assignments  was  measured  by  the  proportion  of  "hits," 
which  is  the  sum  of  the  elements  on  the  main  diagonal  divided  by  the  total 
number  of  assignments.  On  the  dependent  sample,  there  were  90.3  percent  hits 
for  the  single-occurrence  assignments  and  86.6  percent  for  the  assignments 
made  by  the  multiple-occurrence  variables.  On  the  independent  sample,  the 
corresponding  figures  are  51.2  percent  versus  49.6  percent.  Thus,  on  both 
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TABU  IV 


jLsi 


COMPARISON  OF  SINGLE  VS.  MULTIPLE  WORD  OCCURRENCE 


Single  Occurrence 


Dependent  Sample 

(b)  Multiple  Occurrence 


Actual  Actual 


U 

C 

S 

Total 

U 

C 

S 

Total 

*  U 

199 

16 

13 

228 

T»  « 

3  87 

10 

10 

207 

« 

a 

«  c 

7 

202 

3 

214 

07 

&  c 

•r« 

13 

184 

3 

200 

«) 

«  s 

5 

18 

201 

224 

CO 

*  s 

11 

42 

206 

259 

Total 

211 

236 

219 

666 

Total 

211 

236 

219 

666 

X  Hits  - 

602 
666  “ 

90.3 

X  Hits  - 

577 
666  “ 

86.6 

Independent  Sample 

(c)  Single  Occurrence  (d)  Multiple  Occurrence 


Actual 


u 

C 

S 

Total 

•O  u 

72 

35 

21 

128 

01 

Sc 

31 

40 

20 

91 

» 

<  S 

28 

27 

38 

113 

Total 

131 

102 

99 

332 

X  Hits  - 

170 
332  " 

51.2 

Actual 


U 

C 

S 

Total 

■tj  u 

62 

27 

14 

103 

07 

8>  r 

31 

37 

19 

87 

<0 

38 

38 

66 

142 

Total 

131 

102 

99 

332 

X  Hits  - 

Afei. 

332 

49.6 
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samples  the  single-occurrence  type  variables  resulted  in  better  assignments. 

It  is  concluded,  therefore,  that  subsequent  occurrences  of  a  word  in  a  para¬ 
graph  do  not  contribute  any  information  for  assigning  security  classifications. 

12.  Word-Pairs 

An  experiment  was  done  to  determine  whether  word-pairs  contribute  dis¬ 
criminating  information,  The  procedure  was  quite  similar  to  the  one  lust 
described.  Two  small  sets  of  variables,  one  with  and  one  without  word-pairs, 
were  used  to  assign  security  classifications  to  paragraphs.  The  assignments 
were  then  verified  and  compared.  The  variables  with  word-pairs  arc  the 
mulOO,  mclOO,  and  aslOO  defined  in  the  previous  section.  A  new  set  of  three 
variables  was  obtained  by  summing  scores  over  single  words  of  a  paragraph, 
not  including  word-pairs.  As  before,  deviations  from  means  were  computed, 
and  the  highest  of  the  three  deviations  for  a  paragraph  governed  the  security 
assignment. 

The  assignments  were  verified  on  the  independent  sample  only  and  are 
presented  in  Table  V. .  The  percentage  of  hits  is  50.9  without  word-pairs 
and  51.2  with  word-pairs.  Although  the  rise  is  rather  small,  it  is  con¬ 
cluded  that  word-pairs  do  contribute  a  small  amount  of  discriminating  infor¬ 
mation  over  and  above  that  contributed  by  single  words. 

13.  Four  or  More  Paragraphs 

Predictor  variables  are  obtained  by  considering  the  scores  of  usable 
words  only  (see  Sections  9.1  and  9.2).  One  characteristic  of  a  usable 
word  is  that  it  must  have  appeared  in  at  least  four  paragraphs.  Four  was 
chosen  because  it  is  the  minimum  number  necessary  for  S,  defined  by  Equations 
(4-3),  to  have  certain  desirable  statistical  features.  A  test  was  made  to  see 
if  requiring  more  than  four  paragraphs  would  result  in  better  predictors. 
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TABLE  V 
TEST  OF  USEFULNESS 


Word-Pairs 


U 

Actual 

C 

S 

Total 

"S  u 

72 

35 

21 

128 

£>  „ 

■3  c 

V) 

31 

40 

20 

91 

<  s 

28 

27 

58 

113 

Total 

131 

102 

99 

332 

WORD-PAIRS 


No  Word-Pairs 


U 

Actual 

C 

S 

Total 

,,  u 

70 

37 

21 

128 

S c 

32 

38 

17 

87 

CO 

5  s 

29 

27 

61 

117 

Total 

131 

102 

99 

332 

X  Hits 


169 

332 


50.9 


Again  the  procedure  was  similar  to  that  described  in  Section  11.  Three 
small  sets  of  variables  were  obtained:  one  using  the  four-paragraph  criterion, 
a  second  using  10  paragraphs  as  the  criterion,  and  a  third  with  15  paragraphs. 
For  each  criterion  three  variables  were  computed  as  before.  The  four-paragraph 
variables  are  the  same  mulOO,  mclOO  and  mslOO  used  in  Sections  11  and  12. 

Two  new  sets  of  variables  were  obtained  by  redefining  a  usable  word  to  in¬ 
clude  (a)  only  those  words  appearing  in  10  or  more  paragraphs,  and  (b)  only 
words  appearing  in  15  or  more  paragraphs.  New  MU100,  MS100  and  MC100  vari¬ 
ables  were  computed,  and  deviations  from  their  means  were  used  to  assign 
security  classifications.  The  verif icatidns  on  the  independent  sample  of 
data  are  presented  in  Table  VI.  The  percentage  of  hits  was  51.2  for  the 
four-paragraph  criterion,  51.2  for  10  paragraphs  and  50.6  for  15  paragraphs. 
Since  neither  of  the  other  criteria  gave  results  better  than  the  four-paragraph 
criterion  the  latter  was  retained. 

14.  Size  of  Sample 

How  many  paragraphs  are  required  in  the  developmental  sample  to  develop 
an  objective  method  of  assigning  classifications?  It  is  well-known  that  use 
of  large  samples  would  result  in  more  accurate  assignments.  On  the  other 
hand,  the  original  subjective  assignment  of  security  classifications  to  a 
developmental  sample  of  paragraphs  and  the  collection  and  processing  of  such 
data  is  quite  costly.  Thus,  the  fewer  the  number  of  paragraphs  required  the 
less  the  cost.  This  conflict  was  deemed  sufficiently  important  to  justify 
a  more  elaborate  experiment  than  the  four  described  in  Sections  10  to  13. 

Three  developmental  samples  were  generated,  one  being  twice  the  size  of 
the  other  two.  The  first  consisted  of  the  666  paragraphs  obtained  by  elimi¬ 
nating  every  third  paragraph  from  the  entire  sample  of  998  paragraphs.  The 
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Assigned 


TABLE  VI 


TEST  OF  NUMBER  OF  PARAGRAPH  CRITERION 


(a)  Four  or  More  Paragraphs 


(b)  Ten  or  More  Paragraphs 


U 
C 
S 

Total 


u 

Actual 

C  S 

Total 

U 

Actual 

C  S 

Total 

72 

35 

21 

128 

79 

36 

26 

141 

31 

AO 

20 

91 

Sc 

<0 

26 

41 

23 

90 

28 

27 

58 

113 

■3  s 

26 

25 

50 

101 

131 

102 

99 

332 

Total 

131 

102 

99 

332 

X  Hits  - 

170 
332  “ 

51.2 

X  Hit*  » 

170 
332  “ 

51.2 

(c)  Fifteen  or  More  Paragraphs 

Actual 


U 

C 

S 

Total 

-O  u 

80 

36 

29 

145 

Sc 

26 

39 

21 

86 

CD 

25 

27 

49 

101 

Total 

131 

102 

99 

332 

X  Hits  -  ~  .  50.6 
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other  two  of  333  paragraphs  each  const.  -  or  every  other  paragraph  of  th« 
tlrat  developmental  sample. 

The  66  predictor  variables  described  in  Section  4  were  computed  for 
each  sample  separately.  The  screening  multiple  regression  technique,  de¬ 
scribed  in  Section  3.6  was  applied  to  each  group  of  predictors,  and  a  set 
of  three  regression  equations  was  obtained  from  each  of  the  three  samples. 

The  sets  were  compared  by  applying  them  to  the  independent  sample  of 
data.  Application  of  a  set  results  in  three  numbers  for  each  paragraph,  the 
highest  number  determining  the  assignment  of  a  security  classification  to 
that  paragraph.  The  assignments  were  verified  by  comparison  with  the  actual 
classifications.  The  results,  presented  in  Tableau,  show  that  the  per¬ 
centage  of  hits  of  the  larger  sample  (51.2)  is  higher  than  that  obtained  from 
either  of  the  two  smaller  samples  (50.9  and  46.6,  respectively).  Therefore, 
it  was  concluded  that  333  paragraphs  were  insufficient  for  a  developmental 
sample. 

15.  Shrinkage 

While  conducting  the  previous  experiments,  it  was  noted  that  the  accuracy 
of  security  assignments  for  paragraphs  in  the  Independent  sample  were  consider¬ 
ably  less  than  the  accuracy  for  paragraphs  of  the  developmental  sample.  Such 
loss  in  accuracy  is  termed  "shrinkage."  In  experiment  2,  described  in  Section 
11,  the  percentage  of  hits  fell  from  90.3  for  the  developmental  sample  to 
51.2  for  the  independent  sample.  To  pursue  this  point  further,  the  set  of 
three  regression  equations  developed  on  the  large  sample  of  paragraphs  (see 
Section  5. .5)  was  applied  to  this  same  sample..  As  before,  the  highest  of  the 
three  numbers  obtained  for  each  paragraph  determined  the  assignment  of  a 
security  classification  to  that  paragraph.  Comparison  of  assigned-versus-actual 
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Assigned 


TABLE  VII 


EFFECT  OF  SAMPIJS  SIZE  IN  ACCURACY 

(g)  Large  Sample 
Actual 


U 

C 

S 

Total 

■o  U 
v 

64 

26 

11 

101 

§, 

®c 

m 

i5 

39 

42 

24 

105 

28 

34 

64 

126 

Total 

131 

102 

99 

332 

*  Hits  -  -Hr  -  51.2 


(hi  First  Small  Sample 
Actual 


U 

C 

s 

Total 

•o  U 
a> 

Sc 

65 

26 

8 

99 

33 

45 

33 

111 

(ft 

is 

33 

29 

58 

120 

Total 

131 

100 

99 

330* 

X  Hits  - 

168 
330  “ 

50.9 

*Two  esses  were  Inadvertently  lost 
influence  the  conclusion. 


(c)  Second  Small  S aaple 
Actual 


U 

C 

S 

Total 

u  U 

52 

30 

17 

99 

Sc 

36 

44 

24 

104 

• 

3 

<  s 

42 

27 

58 

127 

Total 

130 

101 

99 

330* 

X  Hits  -  -  46.4 


However,  this  does  not  materially 
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classification*  in  given  in  Table  VllX(a) .  fable  VIII (b)  is  a  repeat  of 
Table  VII (a)  which  gives  the  verification  results  of  applying  this  saee 
set  of  equations  to  the  independent  set  of  paragraphs.  The  percentage  of 
hits  drops  from  92.9  on  the  developmental  a ample  to  51.2  on  the  independent 
saaple.  Although  shrinkage  is  likely  to  occur  in  any  similar  atatistical 
analysis,  the  amount  of  shrinkage  observed  here  is  much  greater  than  usual. 

The  cause  of  the  shrinkage  lies  in  the  method  employed  in  forming  the 
predictor  variables.  An  example  serves  to  Illustrate  the  difficulty.  Con¬ 
sider  the  values  for  paragraph  1  of  MU100,  MC100,  and  MS100  and  assure  that 
paragraph  1  is  secret.  The  explanation  which  follows  would  hold  for  any 
paragraph  of  any  security  classification  and  any  set  of  three  variables. 

The  values  MUIOO^,  MCIQO^,  and  MSlOl^  were  obtained  by  summing  the  scores 
of  all  usable  words  in  paragraph  i.  The  scores  were  computed  by  equations 
(IV-3),  repeated  below, 


S,  -  <N,  -  E  )/( E,  ) 

Ju  '  ju  ju'  ju' 


1/2 


sjc  - 


1/2 


SJ-  *  (%  -  V/(V 


1/2 


(IV-3) 


where  N  is  the  number  of  secret  paragraphs  In  which  word  J  appears.  Now 
Ja 

(the  crucial  point)  if  word  j  appears  In  paragraph  i,  a  secret  paragraph, 

then  N.  would  tend  to  be  larger  than  either  N.  or  N.  simply  because  para- 
Js  Ju  Jc  ^ 

graph  i  contributes  to  N  but  not  to  either  of  the  other  two.  Therefore, 

S,  will  tend  to  be  larger  than  either  S.  or  S.  .  In  fact,  all  S.  value;: 
Ja  *  Ju  Jc  -  js 
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VERIFICATION  OF  LARGE-SAMPLE  REGRESSION  EQUATIONS 


Developmental  San 


Independent  San 


U 

Actual 

C  S 

Total 

U 

Actual 

C 

S 

Total 

*o  U 

4) 

195 

4 

6 

205 

•o  U 
« 

64 

26 

11 

101 

ar 

M  C 

W 

8 

214 

6 

228 

5  r 

*r4  C 
ft 

M 

39 

42 

24 

105 

7 

16 

207 

230 

*  S 

28 

34 

64 

126 

Total 

210 

234 

219 

663 

Total 

131 

102 

99 

332 

X  Hits  - 

616 
663  “ 

92.9 

X 

Hits 

170 
“  332 

■  51, 

,2 

entering  into  th«  formation  of  MSlOOj  will  t«nd  to  be  larger  than  the  cor¬ 
responding  valuaa  entering  into  the  formation  of  either  MUlOt^  or  MCIOO^ 
Therefore,  it  ie  not  surprising  chat  MSIOO^  tends  to  be  the  largest  of  the 
three  values.  As  illustrated  by  experlaent  2  (Section  ill,  this  tendency 
is  strong  enough  to  achieve  a  percentage  of  hits  of  90  on  the  developmental 
sample.  This  means  that  in  90  percent  of  the  paragraphs  MA100  Is  larger 
than  the  other  two  predictors,  where  A  la  the  security  classification  of 
the  peregreph  (A  •  U,  C,  or  S), 

A  method  vas  devised  to  reduce  the  shrinkage.  The  developmental  sample 
of  data  was  divided  into  two  equal  samples  (designated  as  A  and  B)  by  taking 
every  other  paragraph.  Equations  (XII-5)  were  applied  to  each  sample  separately 
to  compute  word-scores.  Thus,  the  same  word  has  six  scores  attached  to  it, 
three  from  sample  A  and  three  from  sample  B.  Predictor  variables  were  formed 
a»  before  (Section  9  )  by  summing  scores.  However,  for  paragraphs  In  sample 
B,  wcrd-acorea  from  sample  A  are  applied  and  vice-versa.  The  end  result  la  a 
set  of  66  predictor  variables  for  every  paragraph  in  the  developmental  sample. 
The  screening  multiple  regression  technique  was  applied  to  these  predictor 
variables  to  obtain  three  regression  equations. 

To  measure  the  shrinkage,  Che  equations  were  applied  first  to  samples 
A  and  B,  which  together  constitute  the  developmental  sample,  and  then  to  the 
independent  sample.  To  apply  the  equations  to  the  independent  sample  of  para¬ 
graphs,  the  word-scores  developed  on  A  and  those  developed  on  B  were  averaged. 
That  is,  a  word  appearing  In  a  paragraph  of  the  Independent  sample  first  had 
aix  scores  attached  to  it.  These  were  reduced  to  three  by  averaging  the 
corresponding  sample  A  and  sample  B  scores.  After  averaging,  the  predictor 
variables  were  formed  as  before  b>  Bumming,  or  counting,  over  usable  words  of 
a  paragraph. 
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At  In  Section  IS,  the  highest  of  three  resulting  veluea  for  a  para¬ 
graph  determines  the  aaelgnaent  of  a  security  classification  to  that  para¬ 
graphs  The  assignment*  were  verified  by  comparison  with  the  actual  clas¬ 
sifications  and  the  reeulta  are  given  in  Table  lX<a)  for  the  developmental 
sample  and  Table  IX(b)  for  the  independent  sample.  The  percentages  of  hits 
are  52.9  and  51.1  for  the  developmental  and  independent  aample,  respectively. 

Although  the  method  wee  quite  successful  in  raduelng  shrinkage,  it  did 
not  succeed  in  raising  the  absolute  level  of  the  accurecy  of  security  assign¬ 
ments  to  the  paragraphs  in  the  Independent  aample.  This  can  be  seen  by  com¬ 
paring  Table  VHI(b)  with  Table  lX(b).  The  first  measures  accuracy  on  the 
independent  sample  using  the  usual  type  predictors,  whereas  the  second  measures 
accuracy  of  the  "shrink-realatant"  predictors.  The  percentages  of  hits  ere 
almost  the  same,  and  the  tables  sre  quite  similar.  This  was  quite  e  disappoint 
meat  to  us.  Nevertheless,  the  method  is  considered  to  possess  considerable 
merit  because  it  leads  to  much  more  realistic  estimates  of  the  accuracy  to  be 
expected  on  an  independent  saaqple  of  data.  In  many  problems.  Independent  data 
may  not  be  available  or  may  ba  too  costly  to  collect  and  process. 
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Assigned 


TABLE  1* 

VERIFICATION  OF  SHRINKAGE  RESISTANT  REGRESSION  EQUATIONS 


Developmental  Sample 


Actual  Actual 


U 

C 

S 

Total 

U 

C 

S 

Total 

•o  o 

90 

53 

31 

174 

•t,0 

59 

28 

14 

101 

b 

61 

110 

38 

209 

&C 

31 

40 

16 

87 

U) 

55 

72 

149 

276 

40 

32 

69 

141 

Total 

206 

235 

218 

659* 

Total 

130 

100 

99 

329* 

*A  few  cases  are  omitted  because  of  the  method  of  computation. 
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16.  Appllcatlou  of  the  gggg  Technique  -  Staple  Dummying 

The  regression  estimation  of  event  probabilities  {KEEP)  statistical 
technique  tras  used  to  develop  methods  for  assigning  security  classifications . 

It  will  be  recalled  (Section  6)  that  the  first  Btep  in  the  REEP  procedure 
la  to  generate  a  set  of  dummy  predictor  variables  from  each  of  the  continuous 
type  predictor  variables.  Such  generation  can  be  done  in  a  number  of  ways. 

Two  were  attempted;  one  is  reported  upon  in  this  section  and  the  other  in 
the  next  section. 

As  a  result  of  the  previously  discussed  experiments,  the  following  rules 
were  used  to  form  66  continuous  predictor  variables: 

(a)  The  entire  set  of  998  paragraphs  was  divided  into  a 
developmental  sample  of  666  paragraphs  and  an  inde¬ 
pendent  sample  of  332  paragraphs  by  extracting  every 
third  paragraph  to  form  the  independent  sample. 

(b)  Equations  (IV-3)  were  applied  to  the  entire  develop¬ 
mental  sample  of  paragraphs  to  compute  three  scores 
for  each  non-function  word  and  word-pair  which  ap¬ 
peared  in  at  least  four  paragraphs. 

(c)  The  method  of  forming  predictor  variables  was  as  de¬ 
scribed  in  Section  9— i.e.,  scores  of  usable  words 

or  word-pairs  were  summed  or  counted  end  only  the  first 
appearance  of  a  word  or  word-pair  in  a  paragraph  waa  used. 

In  this  first  REEP  experiment,  just  one  dummy  predictor  variable  was  gener¬ 
ated  from  each  of  the  66  predictor  variables.  The  method  of  generation  was  the 
same  in  all  cases:  a  dummy  predictor  variable  takes  on  a  value  of  one  if  the 
value  of  the  original  variable  is  greater  than  its  mean  and  is  rero  if  the  value 
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i*  *qu«l  or  smaller  than  tto  mao.  Th«  KEEP  procedure  vu  applied  to  the  re- 
•tilting  66  dummy  predictor  variables. 

Ae  indicated  In  Section  6,  the  end-product  of  such  an  application  la  a 
Mthod  for  computing  the  probabllltlea  that  a  paragraph  la  unclassified,  con¬ 
fidential.  or  secret.  The  largest  of  these  three  probabilities  determined 
the  essignaMmt  of  s  security  classification  to  s  paragraph.  Assignments  were 
made  In  this  way  to  ell  paragraphs  in  the  independent  sample.  The  assignments 
were  verified  as  before  by  coaparlson  with  actual  classifications.  The  results 
are  given  in  Table  X. 


TABLE  X 

ACCURACY  OF  ASSIGNMENT  BY  REEP  TECHNIQUE 
INDEPENDENT  SAMPLE  OF  PARAGRAPHS 

Actual 


u 

C 

S 

Total 

67 

35 

19 

121 

36 

41 

20 

97 

28 

26 

60 

114 

Total  131  102  99  332 


The  percentage  of  hits  is  only  50. 6Z.  This  is  lower  than  the  percentage  achieved 
by  the  regression  equations  of  Section  15  [see  Table  VIII (b)]. It  is  not  even 
as  good  as  the  accuracy  attained  by  the  very  simple  procedure  described  in 
Section  11  (see  Table  IV). 

The  cause  of  such  low  accuracy  was  hypothesized  to  be  the  rather  simple 
method  we  employed  to  generate  the  dummy  predictor  variables.  Therefore,  another 
method  was  devised  and  is  reported  upon  in  the  next  section. 
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17.  Application  of  REE?  Technique  -  Final  System 

The  second  out  hod  for  generating  dunny  predictor  variables  is  to  find 
K-l  numbers  which  separate  an  original  predictor  variable  into  K  groups  so 
that  prespecified  percentages  of  the  total  number  of  cases  fall  into  each 
group.  The  allowable  percentages  for  K  -  5  and  K  ■  6  are  listed  in  Table  XT. 

The  percentages  were  obtained  from  Cox  (  2  }  and  Bryan  and  Southen  (  1  ]  who 
have  devised  s  method  for  dividing  a  continuous  variable  into  K  groups  such 
that  the  grouping  error  is  minimized  for  a  stated  K. 

TABLE  XI 

PERCENTAGE  FREQUENCY  DISTRIBUTION  OF  OBSERVATIONS  IN  K.  DUMMIES 

1C  _ Percentage  of  Total  Observations 

5  10. S  23.7  30.8  23.7  10.9 

6  7.4  18.1  24.5  24.5  18.1  7.4 

Eleven  dummy  predictor  variables  were  generated  from  an  original  variable. 
This  was  accomplished  by  ranking  the  666  values  of  a  variable  in  numerical  order 
from  lowest  to  highest  value  and  counting  up  to  the  required  percentage  of  ob- 
servatlons.  For  example,  for  K  -  5  the  first  number  found,  call  it  ,  is 
the  value  of  the  (.109)  x  (666)  -  73rd  ranked  observation  of  the  variable;  the 
second  number,  >  is  the  value  of  the  (.109  +  .237)  x  (666)  *  230th  obser¬ 
vation;  and  so  on  for  and  Once  the  four  l^-values  are  obtained, 

they  are  used  to  generate  5  dummy  predictor  variables  where 
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Dummy  1  -  1  if  X  s  ;  othervi  e  dummy  1-0 

Du«y  2  *  1  if  <  X  s  ;  otherwise  dummy  2-0 

Dummy  3  -  1  if  <  X  s  L^^ ;  otherwise  dviamy  3-0 

Dummy  4  -  1  if  L^  <  X  2  L^^;  otherwise  dumey  4-0 

Dummy  5  -  1  if  L^  <  X  <  X;  otherwise  dummy  5-0 

This  same  procedure  was  used  to  generate  another  set  of  Bix  dummy  variables  for 
each  predictor  by  using  the  percentages  listed  in  the  la3t  row  of  Table  XI . 
These  are  designated  aa  dummy  6,  dummy  7,  ....  dummy  11. 

The  entire  set  of  66  variables  was  nut  used  to  generate  dummy  variables. 
This  would  have  resulted  in  11  x  66  -  726  dunsny  predictor  variables  and  the 
REEF  computer  program  cannot  handle  this  many  variables.  It  is  our  judgment, 
however,  that  very  little  is  lost  by  this  reduction  because  the  predictor  vari¬ 
ables  are  very  highly  correlated.  A  66  x  66  matrix  of  correlation  coefficients 
was  computed  between  each  variable  and  every  other  variable.  The  correlations 
were  quite  high;  and  from  inspection  of  this  matrix  the  30  variables  listed  in 
Table  XII  were  chosen  to  generate  dummy  predictor  variables. 

TABLE  XII 

VARIABLES  USED  TO  GENERATE  DUMMY  PREDICTOR  VARIABLES* 


MU  100 

FU66 

HU1 

MCIOO 

FC66 

HC1 

MS  100 

FS66 

HS1 

MU  66 

FU50 

HU  3 

MC66 

FC50 

HC3 

MS66 

FS50 

HS3 

MU50 

FU34 

HU  5 

MC50 

FC34 

HC5 

MS  50 

FS34 

HS5 

MU34 

MC34 

MS34 

*Sec  Section  4  for  definitions  of  the  variables. 
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There  were  21  x  30  ■  330  dummy  predictor  variables. 

The  R£EP  technique  wee  epplle ' .  The  dump  predictor  variables  selected 
by  the  technique  and  the  equatlor  -  generated  are  shown  in  Table  XIII. 

i ADLE  XIII 

DUMMY  PR5  'TORS  SELECTED  BY  REEP 


Predictors  Predictand 


Original 

Dummy  No 

Unclassified 

Confidential 

Secret 

B(0) 

0.15546 

0.60987 

0.23466 

B(l) 

MU100 

11 

0.38873 

-0.33865 

-0.05007 

B(2) 

MU100 

12 

0.35807 

-0.28741 

-0.07066 

B  ( 3  ) 

MS  100 

11 

-0.08246 

-0.17737 

0,25983 

B(4) 

MS  100 

12 

-0.08028 

-0.27587 

0.35615 

0(5) 

MS  100 

4 

-0.09212 

-0.25260 

0.34473 

B(6) 

MU100 

10 

0.36366 

-0.22026 

-0.14340 

B(7) 

MS  100 

4 

-0.08509 

-0. 30858 

0.39367 

B(8) 

MU100 

5 

0.46329 

-0.29042 

-0.17288 

B(9) 

MC100 

4 

-0.15311 

0.38555 

-0.23243 

B(10) 

MC100 

5 

-0.17295 

0.40656 

-0.23361 

B(ll) 

FS50 

4 

-0.01907 

-0.10652 

0.12559 

B(12) 

MC100 

8 

0.06903 

-0.10464 

0.03561 

B(13) 

MU100 

4 

0.08837 

-0.11252 

0.02415 

Inspection  of  Table  XIII  indicates  that  12  dummy  predictors  were  selected t 
and  that  all  but  one  of  them  originated  from  predictor  variables  formed  by  summing 


all  usable  words  of  a  paragraph.  This  has  the  quite  interesting  implication  that 


It  is  not  simply  the  appearance  of  one  or  two  strong  words  which  governed  the 


original  classification  of  the  paragraphs  but  rather  the  totality  of  the  words. 

The  regression  equations  listed  in  Table  XIII  appear  to  be  quite  reasonable. 
Consider  the  equation  for  the  unclassified  predictand.  The  first  predictor 
selected  is  MU100,  dummy  11  and  its  coefficient  is  +0.  38873.  This  variable  takas 
on  a  value  of  one  if  MU100  is  in  the  next  to  highest  of  six  categories  (i,e.» 

<  MU100  <  L^),  Therefore,  the  probability  that  a  paragraph  is  unclassified 
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1*  increased  by  0,30113  if  MU100  is  high.  Further  inspection  reveals  that 


the  probability  of  a  paragraph  being  unclassified  decreases  is  either  MS10Q 
or  MC100  is  hig>- .  This  too  is  quite  reasonable. 

The  three  equations  listed  in  Table  XIII  were  applied  to  each  paragraph 
of  the  independent  sample,  and  the  highest  of  the  three  resulting  values  for 
a  paragraph  di  rmined  the  assignment  of  a  security  classification  to  that 
paragraph.  The  verification  results  are  shown  in  Table  XIV.  The  percentage 
of  hits  was  53.9  which  is  higher  than  any  of  the  percentages  achieved  here¬ 
tofore.  This  a.reed  with  our  expectations  since  our  experience  in  previous 
investigation  if  a  similar  nature  has  indicated  that  REEF  is  the  most  logical 
and  natural  technique  to  use. 


TABLE  XIV 

VERIFICATION  OF  REEF  EQUATIONS  ON  INDEPENDENT  SAMPLE 


Actual 


U 

C 

S 

Total 

T3  U 

0> 

61 

22 

9 

92 

6  ^ 

•H  C 

43 

59 

31 

133 

(0 

27 

21 

59 

107 

Total 

131 

102 

99 

332 
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Several  paragraphs  misclassified  by  the  Final  REEP  system  were  examined 
subjectively  in  an  attempt  to  determine  causes  of  error.  Ten  paragraphs 
assigned  a  high  probability  of  being  secret  but  which  actually  were  unclassified, 
and  five  paragraphs  assigned  unclassified  but  actually  secret  were  presented 
to  two  scientists  with  long  experience  in  chemical-biological  warfare  problems. 
Both  scientists  in  the  past  have  been  charged  with  assigning  security  classifi¬ 
cations  . 

Independently,  both  scientists  could  see  nothing  in  any  of  the  five  actually 
secret  paragraphs  which  they  felt  would  cause  the  paragraphs  to  be  secret.  For 
the  ten  actually  unclassified  paragraphs , the  scientists  agreed  that  they  should 
be  unclassified.  Thus,  the  two  scientists  agreed  in  one  case  and  disagreed  in 
another  with  the  original  classifiers.  The  examination  did  not  reveal  any 
obvious  means  of  improving  the  REEP  system. 
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SECTION  Vt 


CONCLUSIONS  AND  RECOMMENDATIONS 

An  automatic  computer  method  was  devised  for  assigning  a  security  classi¬ 
fication — unclassified,  confidential,  secret — to  a  paragraph.  The  assignment 
depends  entirely  on  scores  attached  to  the  words  of  the  paragraph.  The  method 
was  tested  on  an  independent  sample  of  332  paragraphs,  and  53.9  percent  were 
correctly  classified.  Application  of  the  binomial  test  indicates  that  53.9 
percent  is  statistically  significantly  greater  than  the  33.3  percent  which 
could  be  achieved  by  chance.  Therefore,  it  is  concluded  that  the  method  does 
show  skill. 

Although  the  automatic  method  is  better  than  chance,  it  Is  not  known  how 
It  would  fare  when  compared  with  the  present  subjective  method  of  assigning 
security  classifications.  It  is  our  opinion  that  the  skill  is  too  low  to 
consider  replacing  the  present  subjective  method  of  classification.  To  test 
this  opinion,  it  is  recommended  that  a  study  be  conducted  to  measure  the  skill 
of  the  subjective  forecaster.  The  simplest  way  to  do  this  would  be  to  have  two 
persons  independently  classify  the  same  set  of  paragraphs.  The  amount  of 
matching  of  the  two  classifications  would  be  a  measure  of  skill.  If  this 
amount  exceeds  53.9  percent  then  the  subjective  classification  is  better  than 
the  automatic  me-hod.  Use  of  more  than  two  classifiers  would  enhance  the 
confidence  in  the  results.  In  either  case,  such  an  experiment  would  not  only 
provide  a  benchmark  for  measuring  the  success  of  objective  methods,  but  it 
would  provide  valuable  insight  into  present  methods  of  security  classification. 

Even  if  the  objective  method  turned  out  to  be  not  as  good  as  the  subjective 
method,  it  might  prove  quite  useful  in  providing  guidance  to  a  classifier.  To 
this  end,  it  is  recommended  that  two  experiments  be  conducted.  The  first  is 
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rather  staple  whereas  the  second  is  acre  complicated  and  costly  but  offers 
greater  potential  usefulness. 

The  first  experiment  entails  use  of  the  probabilities  produced  by  the 
REEP  technique.  It  will  be  recalled  that  the  KEEP  procedure  results  in  the 
probabilities  that  a  paragraph  belongs  to  each  of  the  three  security  classifi¬ 
cations.  The  experiment  consists  of  supplying  such  probabilities  to  a  number 
of  classifiers  for  a  period  of  time  and  then  surveying  them  to  determine  their 
opinion  of  the  usefulness  of  the  probabilities.  An  alternative  is  to  have  a 
group  of  classifiers  assign  two  security  classifications  to  a  set  of  paragraphs — 
the  first  without  using  the  probabilities  and  the  second  after  seeing  the  prob¬ 
abilities.  Appropriate  verification  procedures  could  then  determine  whether 
or  not  the  probabilities  are  useful. 

The  second,  and  more  elaborate,  experiment  involves  a  "sterile  environment" 
classification.  It  Is  generally  agreed  that  there  is  a  tendency  to  overclassify 
documents  since  there  is  no  penalty  for  overclassification  as  there  is  for 
underclassification.  A  group  of  classifiers  would  be  asked  to  assign  classi¬ 
fications  anonymously  to  the  same  sample  of  paragraphs.  The  consensus  would 
be  the  "correct"  classification.  The  objective  technique  would  be  developed 
on  this  sample  of  paragraphs.  Nov,  the  objective  technique  would  be  applied  to 
a  new  set  of  paragraphs  and  the  resulting  probabilities  would  be  provided  to  a 
classifier  as  guidance.  The  worth  of  the  guidance  information  would  be  evaluated 
as  before.  Hopefully,  this  procedure  would  reduce  the  amount  of  overclassifica- 
tlon.  A  by-product  of  such  an  experiment  would  be  a  comparison  of  each  classifier 
against  the  consensus  to  determine  the  number  of  matches,  i.e.,  the  first  experi¬ 
ment  recommended  would  be  a  by-product  of  this  experiment. 
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It  Is  our  fins  opinion  chat  the  methodology  developed  and  applied  during 
the  course  of  the  study  has  extracted  substantially  all  of  the  information  con¬ 
tained  in  word  and  word-pair  frequencies.  It  is  recommended  that  subsequent 
security  classification  investigations  concentrate  on  other  types  of  inforraat ion-- 
o.g.,  meanings  of  words,  context  surrounding  the  paragraph,  rules  used  to 
assign  security  classifications,  etc. 

It  is  strongly  recommended  Chat  the  methods  he  applied  to  develop 
objective  methods  for  indexing  textual  material.  From  a  strictly  methodo¬ 
logical  viewpoint,  the  three  security  classifications  could  have  been  any 
three  categories — history ,  biology,  mathematics  or  number  theory,  calculus, 
topology~and  the  procedures  would  have  been  the  same.  Of  particular  interest 
in  the  methodology  are  (a)  the  method  of  scoring  words  and  combining  the  scores, 
and  (b)  the  flexibility  of  the  REEP  statistical  technique.  As  far  as  is  known, 
neither  has  been  applied  before  in  automatic  Indexing  procedures.  The  usual 
method  is  to  choose  "key"  words  to  do  the  indexing.  The  proposed  methodology 
would  permit  both  key  words  and  word  scores  to  be  presented  to  the  REEP  tech¬ 
nique  in  a  large  variety  of  ways  for  objective  selection  and  optimum  organi¬ 
zation  of  useful  information  into  an  automatic-computer  indexing  system. 
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APPENDIX 


A.O  ITEMS  FORWARDED  TO  SPONSORING  AGENCY 

Under  terns  of  the  contract,  certain  items  were  furnished  to  the 
Sponsoring  Agency.  The  items  are  described  briefly  bclov. 

A.l  Data 

A. 1.1  Cards 

998  paragraphs  were  punched  onto  IBM  cards  following  the  rules  given  in 
Section  2.  The  procedures  of  Section  4.  were  followed  to  edit  the  card 
data  and  insert  corrected  cards  when  necessary.  The  corrected  data  consist 
of  some  13,000  cards.  There  are  two  types  of  cards,  a  header  card  at  the 
beginning  of  each  paragraph  followed  by  the  data  cards  for  that  paragraph. 

The  format  of  the  header  card  is: 

Columns  Description 

1-2  $$ 

3  Security  classification  of  the  paragraph; 

U  ■  unclassified,  C  *  confidential,  S  ■  secret. 

4-6  Paragraph  number,  1,2, ...,998.  Paragraphs  were 

numbered  consecutively  as  they  were  extracted 
from  the  documents  listed  in  Section  2.6. 

Punching  of  the  data  cards  starts  in  column  1  and  continues  through  column 
70  onto  column  1  of  the  next  card  through  column  70,  etc.  One  word  can  overlap 
two  cards.  Each  paragraph  starts  on  a  new  card.  Columns  71-80  are  left  blank. 

A. 1.2  Raw  Data  Tape 

The  cards  were  processed  by  the  RAW  DATA  TAPE  GENERATOR  PROGRAM  described 
in  Section  A. 2. 2  to  produce  the  Raw  Data  Tape.  The  tape  is  unlabelled  and  was 
written  on  an  IBM  360  using  logical  IOCS.  The  tape  is  9-track,  800  bytes  per 
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inch,  fixed  format  of  32  bytea  per  record  and  SO  records  per  block.  The 
record  format  follows: 


Column* 


Format 


Description 


1-8 

9-16 


Binary  Coded  Decimal  (BCD) 
BCD 


17-20 

21-24 

25-28 

29-32 


Integer  (I) 
I 


Primary  Word  )  first  and  second 
Secondary  Word) 

words,  respectively,  of  a  word-pair. 
See  Section  2.3.2  for  a  definition  of 
word-pair.  Words  at  the  end  of  a 
sentence  are  not  paired  so  the  secon¬ 
dary  word  is  left  blank.  Only  the 
first  six  characters  of  a  word  are 
used  and  these  are  left-adjusted. 

Paragraph  Humber 

Sentence  number  within  paragraph. 

Humber  of  the  IBM  card  (see  Section 
A.  1.1)  on  which  primary  word  begins. 

Classification  of  paragraph;  1  - 
unclassified,  2  »  confidential,  3  - 
secret . 


There  is  one  record  for  each  non-function  word  appearing  in  the  entire 
sample  of  998  paragraphs.  (See  Section  3.3  for  definition  of  non-function 
words.)  A  word  appearing  N  times  will  generate  H  records.  The  order  of  the 
records  is  by  word  within  paragraph,  i.e.,  first  word  of  first  paragraph,  second 
word  of  first  paragraph,  ...,  last  word  of  last  paragraph.  There  are  65,352 
records. 


The  tape  is  classified  SECRET. 

A. 1.3  Basic  Data  Tapes 

There  are  two  Basic  Data  Tapes,  one  for  the  666  paragraphs  of  the  dependent 
or  developmental  sample  and  the  second  for  the  332  paragraphs  of  the  independent 
or  test  sample.  Both  tapes  are  unlabelled  and  written  on  an  IBM  360  using  logical 
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IOCS.  The 

tapes  are  9-track, 

800  bytes  per  inch,  fixed  foraata  of  44  bytes 

per  record 

and  30  records  per 

block.  The  record  format  follows: 

Columns 

Format 

Description 

1-4 

Integer  (1) 

Primary  word  number* 

5-8 

I 

Secondary  word  number* 

9-12 

I 

Number  of  times  word  or  word-pair  appears  in 
paragraph. 

13-16 

Floating  Point 
(F) 

S  Unclassified  score  attached  to  word  or 

word-pair 

17-20 

F 

S.  Classified  Bcore 

21-24 

F 

S  Secret  score  (see  equations  (4-3),  Section 

4.1  for  definitions  of  scores). 

25-28 

1 

N  The  number  of  different  unclassified  para- 

graphs  in  which  the  word  or  word-pair  ap¬ 
pears  at  least  once. 

29-32 

I 

N^c  Same  aa  N^u  but  for  confidential  paragraphs. 

33-36 

1 

N.  Same  as  N,  but  for  secret  paragraphs, 

js  ju 

37-40 

1 

Paragraph  number 

41-44 

I 

Paragraph  classification;  1  ■  unclassified,  2  ■ 
confidential,  3  *  secret. 

*The  actual  words  are  not  used.  Instead,  numbers  are  assigned  to  each  primary 
word  and  secondary  word.  This  allows  a  security  classification  of  UNCLASSIFIED 
which  provides  access  to  the  computer  at  all  times. 

There  is  one  record  for  each  usable  word  in  a  paragraph  (see  Section  4.2.1 
for  a  definition  of  usable).  There  are  32,548  records  on  the  developmental 
sample  Basic  Data  Tape  and  15,485  on  the  independent  sample  tape.  Both  tapes 
are  UNCLASSIFIED. 
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A. 2  Computer  Programs 


A, 2.1  Statistical  Technique* 

Computer  programs  for  applying  the  two  statistical  techniques  and  for 
forming  dummy  predictor  variables  were  written  prior  to  this  contract.  (See 
Section  3  for  a  description  of  the  techniques  and  of  the  dummying  procedure.) 

The  program  are  on  the  TRC  Statistical  Program  Tape  and  are  described  in  a 
manual  prepared  for  the  U.S.  Air  Force  [6],  A  copy  of  the  tape  and  five 
copies  of  the  manual  were  furnished  to  the  Sponsoring  Agency. 

A.2.2  Raw  Data  Tape  Generator 

The  objective  of  the  program  is  to  generate  the  Raw  Data  Tape  described 
in  Section  A. 1.2.  This  is  accomplished  by  processing  the  cards  described  in 
Section  A. 1.1.  Each  paragraph  is  treated  as  one  long  serial  string  of  charac¬ 
ters.  A  word  is  isolated  as  a  successive  group  of  characters  preceded  and 
followed  by  either  a  blank  or  end-of-sentence  mark.  The  word  extremes,  be¬ 
ginning  and  end,  are  then  stripped  of  non-permissible  characters.  The  stripped 
word  Is  set  to  a  maximum  length  of  six  characters  by  dropping  all  characters 
after  the  first  six.  Stripped  words  with  less  than  six  characters  are  left- 
adjusted  and  padded  with  blanks. 

A  function  table  search  is  executed  and  function  words  are  eliminated. 

See  Section  3.3  for  a  list  of  function  words.  Each  non-function  word 
generates  one  record  on  the  output  tape.  The  format  of  such  records  is 
described  in  Section  A. 1.2.  Words  are  isolated  and  processed  in  this  manner 
until  the  set  of  input  cards  is  exhausted. 

The  program  counts  the  number  of  non-function  words,  the  number  of  function 
words,  and  the  total  number  of  words  in  each  paragraph.  The  mean  and  standard 
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deviation  of  each  of  these  three  quantities  is  obtain  separately  for  the 
unclassified,  confidential,  and  secret  paragraphs.  These  IS  values  are 
printed.  Under  program  option  each  record  placed  onto  tape  can  also  be 
printed. 

The  prograta  Is  written  in  assembler  language  for  the  IBM  360  Mod.  AO. 
Input-output  operations  are  overlapped  for  efficiency. 

A. 2. 3  Basic  Data  Tape  Program 

The  program  produces  the  two  tapes  described  in  Section  A. 1.3.  The 
records  on  the  Raw  Data  Tape,  described  in  Section  A. 1.2,  have  previously 
been  alphabetized  first  by  primary  word  then  by  secondary  word  within 
primary  word.  For  each  word  and  word-pair,  counts  are  made  to  obtain  N^, 

M  .  Njc*  Njs  w*lich  are«  r®8P®cclvely*  the  total  number  of  paragraphs  in 
which  word  j  appears  at  least  once  and  the  number  of  such  paragraphs  which 
are  unclassified,  confidential,  and  secret.  The  program  also  computes  the 
number  of  times  each  word  or  word-pair  appears  in  each  paragraph.  For  those 
words  and  word-pairs  for  which  equals  or  exceeds  four,  equations  (IV-3), 
Section  8,  are  applied  to  compute  Sju,  S^c,  Sjg.  The  information  is  placed 
onto  tape  in  the  format  given  in  Section  A. 1.3. 

In  addition  to  producing  the  two  tapes,  the  program  has  three  print 
options  which  permit  it  to  be  used  for  other  purposes. 


a.  The  program  can  print  those  words  which  appear  less  than 
M  times,  where  M  is  an  input  value,  in  the  entire  sample 
of  data.  This  is  useful  for  locating  misspelled  words. 


b.  Each  record  can  be  printed  as  it  is  placed  onto  the  out¬ 
put  tape.  It  is  to  be  notea  that  such  printing  includes 
the  number  of  times  each  word  appears  in  each  paragraph 
so  that  the  program  is  a  frequency-count  program. 
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c.  The  total  number  of  timer;  a  word  appears  in  the  entire 
sample  of  data  can  also  be  printed.  (This  is  not  the 
same  as  Nj ,  the  total  number  of  paragraphs  in  which 
word  j  appears.) 

The  program  is  written  in  assembler  language  for  the  IBM  360  Mod.  AO. 
Input-output  is  overlapped  for  efficiency 

A. 2. A  Prediction  Program 

The  output  of  both  the  screening  multiple  regression  program  and  the 
regression  estimation  of  events  probability  programs  described  in  Section 
A. 2.1  consists  of  three  equations  of  the  form 


ui 


=  A 


uO 


AulXuli  +  Au2Xu2i  + 


A  _X 
uP  uPl 


(A-l) 


A 

Y  ^  is  an  estimate  of  the  probability  that  paragraph  i  is  unclassified.  The 
X’s  are  predictor  variables  selected  by  the  programs,  continuous  type  predictors 
by  screening  regression  and  dummy  predictor  variables  by  REEP.  The  A's  arc 
regression  coefficients  computed  by  least  squares.  Two  similar  equations  es¬ 
timate,  respectively,  the  probabilities  that  paragraph  i  is  confidential  or 
secret . 

The  prediction  program  applies  the  three  equations  to  each  of  a  sample  of 
paragraphs.  The  largest  of  the  three  Y^'s  determines  the  security  classification 
assigned  to  paragraph  i.  The  assignments  are  matched  with  the  actual  classifi¬ 
cations  and  a  three  by  three  contingency  table  is  formed  and  printed.  See 
Section  V  for  several  examples  of  contingency  tables. 

The  program  is  written  in  FORTRAN  IV  for  the  IBM  360  Mod.  AO  computer. 
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li  ABSTRACT 


An  investigation  was  made  of  the  feasibility  of  using  computers  to 
assign  the  proper  security  classification  (unclassified,  confidential), 
secret)  to  textual  material.  The  words  in  998  paragraphs  were  trans¬ 
formed  to  computer-usable  form.  A  set  of  66  variables  was  computed 
for  each  paragraph  by  a  two-stage  process  of  attaching  three  scores 
to  a  word  and  then  combining  the  scores  in  various  ways  over  the 
words  of  a  paragraph.  Several  experiments  were  conducted  to  validate 
assumptions  involved  in  the  method  of  scoring  the  words  and  the  methods 
for  combining  the  scores.  The  66  variables  were  presented  to  a  statis¬ 
tical  technique  which  made  a  preferential  selection  of  a  small  set 
of  effective  variables  from  the  large  set  of  66  variables.  The. re¬ 
dundant  or  non-controlling  variables  were  eliminated  from  subsequent 
analysis,  and  an  objective  system  was  developed  for  assigning  security 
classifications  using  only  the  selected  variables.  The  system  was 
applied  tc  an  independent  sample  of  paragraphs  and  53.9  percent  were 
correctly  classified.  It  was  concluded  that  the  system  does  exhibit 
skill.  However,  the  skill  is  probably  too  low  to  consider  replacing 
the  present  system.  Finally,  it  is  concluded  that  the  method  for 
forming  variables  and  the  statistical  technique,  both  apparently  new 
to  this  field,  show  sufficient  promise  to  merit  application  to  other 
automatic  indexing  problems. 
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